Authors: Daniela Cassol (danielac@ucr.edu), Le Zhang (le.zhang001@email.ucr.edu), Thomas Girke (thomas.girke@ucr.edu).
Institution: Institute for Integrative Genome Biology, University of California, Riverside, California, USA.
systemPipe (SP) is a generic toolkit for designing and running reproducible data
analysis workflows. The environment consists of three major modules implemented
as R/Bioconductor packages:
systemPipeR (SPR) provides core functionalities for defining workflows,
interacting with command-line software, and executing both R and/or command-line
software, as well as generating publication-quality analysis reports.
systemPipeShiny (SPS) integrates a graphical user interface for managing
workflows and visualizing results interactively.
systemPipeWorkflow (SPW) offers a collection of pre-configured workflow templates.
A central concept for designing workflows within the systemPipeR environment is
the use of workflow management containers. systemPipeR adopted the widely used
community standard Common Workflow Language (CWL)
(Amstutz et al. 2016) for describing analysis workflows in a generic and reproducible
manner.
Using this community standard in systemPipeR has many advantages. For instance,
the integration of CWL allows running systemPipeR workflows from a single
specification instance either entirely from within R, from various command-line
wrappers (e.g. cwl-runner) or from other languages (e.g. Bash or Python).
systemPipeR includes support for both command-line and R/Bioconductor software
as well as resources for containerization, parallel evaluations on computer
clusters along with automated generation of interactive analysis reports.
An important feature of systemPipeR's CWL interface is that it provides two
options to run command-line tools and workflows based on CWL.
First, one can run CWL in its native way via an R-based wrapper utility for
cwl-runner or cwl-tools (CWL-based approach). Second, one can run workflows
using CWL’s command-line and workflow instructions from within R (R-based approach).
In the latter case the same CWL workflow definition files (e.g. .cwl and .yml)
are used but rendered and executed entirely with R functions defined by systemPipeR,
and thus use CWL mainly as a command-line and workflow definition format rather
than software to run workflows. Moreover, systemPipeR provides several
convenience functions that are useful for designing and debugging workflows,
such as a command-line rendering function to retrieve the exact command-line
strings for each step prior to running a command-line.
Figure 1: Figure 1
systemPipeR’s preconfigured directory structure.
This workshop uses R 4.1.0 and Bioconductor version3.14. Bioconductor can be
installed following these instructions.
During the Bioc2021 conference, the workshop can be run in the cloud.
The Docker container used by this workshop runs with Bioconductor’s development
version 3.14. It includes all the necessary packages and software for running
the code of in the workshop vignettes.
To use the Docker container, one needs to first install Docker on a user’s system.
docker run -e PASSWORD=systempipe -p 8787:8787 systempipe/systempipeworkshop2021:latest
Log in to RStudio at http://localhost:8787 using username rstudio
and password systempipe.
If you prefer to run the workshop from the command-line:
docker run -it --user rstudio systempipe/systempipeworkshop2021:latest bash
systemPipeR and
systemPipeShiny
environment can be installed from the R console using the BiocManager::install
command. The associated data package systemPipeRdata
can be installed the same way. The latter is a helper package for generating systemPipeR
workflow environments with a single command containing all parameter files and
sample data required to quickly test and run workflows.
To install all packages required for this workshop on a local system, one can use the following install commands.
## Install workshop package
BiocManager::install("systemPipeR/systemPipeWorkshop2021")
## Install required packages
BiocManager::install(c("systemPipeR", "systemPipeRdata", "systemPipeShiny"), version="3.14")
To access the vignette:
browseVignettes(package = "systemPipeWorkshop2021")
library("systemPipeR")
library("systemPipeShiny")
library("systemPipeRdata")
library(help="systemPipeR") # Lists package info
vignette("systemPipeR") # Opens vignette
systemPipeRdata::genWorkenvir("rnaseq", mydirname = "bioc2021")
#> [1] "Generated bioc2021 directory. Next run in rnaseq directory, the R code from *.Rmd template interactively. Alternatively, workflows can be exectued with a single command as instructed in the vignette."
All questions about the package or any particular function should be posted to
the Bioconductor support site https://support.bioconductor.org.
Please add the “systemPipeR” tag to your question. This triggers an email
alert that will be send to the authors.
We also appreciate receiving suggestions for improvements and/or bug reports by opening issues
on GitHub.
sal import run plot
systemPipeR expects a project directory structure that consists of a directory
where users may store all the raw data, the results directory that will be reserved
for all the outfiles files or new output folders, and the parameters directory.
This structure allows reproducibility and collaboration across the data science team since internally relative paths are used. Users could transfer this project to a different location and still be able to run the entire workflow. Also, it increases efficiency and data management once the raw data is kept in a separate folder and avoids duplication.
systemPipeRdata,
helper package, provides pre-configured workflows, reporting
templates, and sample data loaded as demonstrated below. With a single command,
the package allows creating the workflow environment containing the structure
described here (see Figure 1).
Directory names are indicated in green. Users can change this structure as needed, but need to adjust the code in their workflows accordingly.
*.cwl and *input.yml files need to be in the same subdirectory.
Figure 2: Figure 1
systemPipeR’s preconfigured directory structure.
targets fileThe targets file defines all input files (e.g. FASTQ, BAM, BCF) and sample
comparisons of an analysis workflow. The following shows the format of a sample
targets file included in the package. It also can be viewed and downloaded
from systemPipeR’s GitHub repository here.
In a target file with a single type of input files, here FASTQ files of
single-end (SE) reads, the first column defines the paths and the second column
represents a unique id name for each sample. The third column called Factor
represents the biological replicates. All subsequent columns are optional to provide
additional information. Any number of additional columns can be added as needed.
Users should note here, the usage of targets files is optional when using
systemPipeR's new workflow management interface. They can be replaced by a standard YAML
input file used by CWL. Since for organizing experimental variables targets
files are extremely useful and user-friendly. Thus, we encourage users to keep using
them.
targets file for single-end (SE) samplestargetspath <- "targets.txt"
showDF(read.delim(targetspath, comment.char = "#"))
To work with custom data, users need to generate a targets file containing
the paths to their own FASTQ files and then provide under targetspath the
path to the corresponding targets file.
targets file for “Hello World” exampleIn this example, the targets file contains only two columns. The first column
contains short text strings that will be used by the echo command-line. The second column
contains sample ids. The id column is required, and each sample id should be unique.
targetspath <- system.file("extdata/cwl/example/targets_example.txt", package = "systemPipeR")
showDF(read.delim(targetspath, comment.char = "#"))
The parameters required for running command-line software are provided by adopting the widely used CWL (Common Workflow Language) community standard (Amstutz et al. 2016). Parameter files are only required for command-line steps, and for R-based workflow they are optional. An overview of the CWL syntax is provided in the section below, while the here section explains how target files can be used for CWL-based workflow steps.
Users need to define the command-line in a pseudo-bash script format:
# "hisat2 -S ./results/M1A.sam -x ./data/tair10.fasta -k 1 -threads 4 -U ./data/SRR446027_1.fastq.gz "
command <- "
hisat2 \
-S <F, out: ./results/M1A.sam> \
-x <F: ./data/tair10.fasta> \
-k <int: 1> \
-threads <int: 4> \
-U <F: ./data/SRR446027_1.fastq.gz>
"
First line is the base command. Each line is an argument with its default value.
For argument lines (starting from the second line), any word before the first
space with leading - or -- in each will be treated as a prefix, like -S or
--min. Any line without this first word will be treated as no prefix.
All defaults are placed inside <...>.
First argument is the input argument type. F for “File,” “int,” “string” are unchanged.
Optional: use the keyword out followed the type with a , comma separation to
indicate if this argument is also an CWL output.
Then, use : to separate keywords and default values, any non-space value after the :
will be treated as the default value.
If any argument has no default value, just a flag, like --verbose, there is no need to add any <...>
createParam FunctioncreateParam function requires the string as defined above as an input.
First of all, the function will print the three components of the cwl file:
- BaseCommand: Specifies the program to execute.
- Inputs: Defines the input parameters of the process.
- Outputs: Defines the parameters representing the output of the process.
The four component is the original command-line.
If in interactive mode, the function will verify that everything is correct and will ask you to proceed. Here, the user can answer “no” and provide more information at the string level. Another question is to save the param created here.
If running the workflow in non-interactive mode, the createParam function will
consider “yes” and returning the container.
cmd <- createParam(command, writeParamFiles = TRUE)
#> *****BaseCommand*****
#> hisat2
#> *****Inputs*****
#> S:
#> type: File
#> preF: -S
#> yml: ./results/M1A.sam
#> x:
#> type: File
#> preF: -x
#> yml: ./data/tair10.fasta
#> k:
#> type: int
#> preF: -k
#> yml: 1
#> threads:
#> type: int
#> preF: -threads
#> yml: 4
#> U:
#> type: File
#> preF: -U
#> yml: ./data/SRR446027_1.fastq.gz
#> *****Outputs*****
#> output1:
#> type: File
#> value: ./results/M1A.sam
#> *****Parsed raw command line*****
#> hisat2 -S ./results/M1A.sam -x ./data/tair10.fasta -k 1 -threads 4 -U ./data/SRR446027_1.fastq.gz
#> Written content of 'commandLine' to file:
#> param/cwl/hisat2/hisat2.cwl
#> Written content of 'commandLine' to file:
#> param/cwl/hisat2/hisat2.yml
systemPipeRThis section will demonstrate how to connect CWL parameters files to create
workflows. In addition, we will show how the workflow can be easily scalable
with systemPipeR.
Figure 3: Figure 2
Connectivity between CWL param files and targets files.
To create a Workflow within systemPipeR, we can start by defining an empty
container and checking the directory structure:
sal <- SPRproject(projPath = getwd())
#> Creating directory '/home/dcassol/danielac@ucr.edu/projects/BioC2021_Workshop/systemPipeWorkshop2021/vignettes/bioc2021/.SPRproject'
#> Creating file '/home/dcassol/danielac@ucr.edu/projects/BioC2021_Workshop/systemPipeWorkshop2021/vignettes/bioc2021/.SPRproject/SYSargsList.yml'
Internally, SPRproject function will create a hidden folder called .SPRproject,
by default, to store all the log files.
A YAML file, here called SYSargsList.yml, has been created, which initially
contains the basic location of the project structure; however, every time the
workflow object sal is updated in R, the new information will also be store in this
flat-file database for easy recovery.
If you desire different names for the logs folder and the YAML file, these can
be modified as follows:
sal <- SPRproject(logs.dir= ".SPRproject", sys.file=".SPRproject/SYSargsList.yml")
Also, this function will check and/or create the basic folder structure if missing,
which means data, param, and results folder, as described here.
If the user wants to use a different names for these directories, can be specified
as follows:
sal <- SPRproject(data = "data", param = "param", results = "results")
It is possible to separate all the R objects created within the workflow analysis
from the current environment. SPRproject function provides the option to create
a new environment, and in this way, it is not overwriting any object you may want
to have at your current section.
sal <- SPRproject(envir = new.env())
In this stage, the object sal is a empty container, except for the project information. The project information can be accessed by the projectInfo method:
sal
#> Instance of 'SYSargsList':
#> No workflow steps added
projectInfo(sal)
#> $project
#> [1] "/home/dcassol/danielac@ucr.edu/projects/BioC2021_Workshop/systemPipeWorkshop2021/vignettes/bioc2021"
#>
#> $data
#> [1] "data"
#>
#> $param
#> [1] "param"
#>
#> $results
#> [1] "results"
#>
#> $logsDir
#> [1] ".SPRproject"
#>
#> $sysargslist
#> [1] ".SPRproject/SYSargsList.yml"
Also, the length function will return how many steps this workflow contains and
in this case it is empty, as follow:
length(sal)
#> [1] 0
systemPipeR workflows can be designed and built from start to finish with a
single command, importing from an R Markdown file or stepwise in interactive
mode from the R console.
In the next section, we will demonstrate how to build the workflow in an
interactive mode, and in the following section, we will show how to build from a
file.
New workflows are constructed, or existing ones modified, by connecting each
step via appendStep method. Each SYSargsList instance contains instructions
needed for processing a set of input files with a specific command-line or R
software, as well as the paths to the corresponding outfiles generated by a
particular tool/step.
To build R code based step, the constructor function Linewise is used.
For more details about this S4 class container, see here.
This tutorial shows a very simple example for describing and explaining all main features available within systemPipeR to design, build, manage, run, and visualize the workflow. In summary, we are exporting a dataset to multiple files, compressing and decompressing each one of the files, and importing to R, and finally performing a statistical analysis.
In the previous section, we initialize the project by building the sal object.
Until this moment, the container has no steps:
sal
#> Instance of 'SYSargsList':
#> No workflow steps added
Next, we need to populate the object created with the first step in the workflow.
The first step is R code based, and we are splitting the iris dataset by Species
and for each Species will be saved on file. Please note that this code will
not be executed now; it is just store in the container for further execution.
This constructor function requires the step_name and the R-based code under
the code argument.
The R code should be enclosed by braces ({}) and separated by a new line.
appendStep(sal) <- LineWise(code = {
mapply(function(x, y) write.csv(x, y),
split(iris, factor(iris$Species)),
file.path("results", paste0(names(split(iris, factor(iris$Species))), ".csv"))
)
},
step_name = "export_iris")
For a brief overview of the workflow, we can check the object as follows:
sal
#> Instance of 'SYSargsList':
#> WF Steps:
#> 1. export_iris --> Status: Pending
#>
Also, for printing and double-check the R code in the step, we can use the
codeLine method:
codeLine(sal)
#> export_iris
#> mapply(function(x, y) write.csv(x, y), split(iris, factor(iris$Species)), file.path("results", paste0(names(split(iris, factor(iris$Species))), ".csv")))
Next, an example of how to compress the exported files using
gzip command-line.
The constructor function creates an SYSargsList S4 class object using data from
three input files:
- CWL command-line specification file (`wf_file` argument);
- Input variables (`input_file` argument);
- Targets file (`targets` argument).
In CWL, files with the extension .cwl define the parameters of a chosen
command-line step or workflow, while files with the extension .yml define the
input variables of command-line steps.
The targets file is optional for workflow steps lacking input files. The connection
between input variables and the targets file is defined under the inputvars
argument. It is required a named vector, where each element name needs to match
with column names in the targets file, and the value must match the names of
the input variables defined in the *.yml files (see Figure 2).
A detailed description of the dynamic between input variables and targets
files can be found here.
In addition, the CWL syntax overview can be found here.
Besides all the data form targets, wf_file, input_file and dir_path arguments,
SYSargsList constructor function options include:
step_name: a unique name for the step. This is not mandatory; however,
it is highly recommended. If no name is provided, a default step_x, where
x reflects the step index, will be added.dir: this option allows creating an exclusive subdirectory for the step
in the workflow. All the outfiles and log files for this particular step will
be generated in the respective folders.dependency: after the first step, all the additional steps appended to
the workflow require the information of the dependency tree.The appendStep<- method is used to append a new step in the workflow.
targetspath <- system.file("extdata/cwl/gunzip", "targets_gunzip.txt", package = "systemPipeR")
appendStep(sal) <- SYSargsList(step_name = "gzip",
targets = targetspath, dir = TRUE,
wf_file = "gunzip/workflow_gzip.cwl", input_file = "gunzip/gzip.yml",
dir_path = system.file("extdata/cwl", package = "systemPipeR"),
inputvars = c(FileName = "_FILE_PATH_", SampleName = "_SampleName_"),
dependency = "export_iris")
Note: This will not work if the gzip is not available on your system
(installed and exported to PATH) and may only work on Windows systems using PowerShell.
For a overview of the workflow, we can check the object as follows:
sal
#> Instance of 'SYSargsList':
#> WF Steps:
#> 1. export_iris --> Status: Pending
#> 2. gzip --> Status: Pending
#> Total Files: 3 | Existing: 0 | Missing: 3
#> 2.1. gzip
#> cmdlist: 3 | Pending: 3
#>
Note that we have two steps, and it is expected three files from the second step. Also, the workflow status is Pending, which means the workflow object is rendered in R; however, we did not execute the workflow yet. In addition to this summary, it can be observed this step has three command-lines.
For more details about the command-line rendered for each target file, it can be checked as follows:
cmdlist(sal, step="gzip")
#> $gzip
#> $gzip$SE
#> $gzip$SE$gzip
#> [1] "gzip -c results/setosa.csv > results/SE.csv.gz"
#>
#>
#> $gzip$VE
#> $gzip$VE$gzip
#> [1] "gzip -c results/versicolor.csv > results/VE.csv.gz"
#>
#>
#> $gzip$VI
#> $gzip$VI$gzip
#> [1] "gzip -c results/virginica.csv > results/VI.csv.gz"
outfiles for the next stepFor building this step, all the previous procedures are being used to append the next step. However, here, we can observe power features that build the connectivity between steps in the workflow.
In this example, we would like to use the outfiles from gzip Step, as input from the next step, which is the gunzip. In this case, let’s look at the outfiles from the first step:
outfiles(sal)
#> $export_iris
#> DataFrame with 0 rows and 0 columns
#>
#> $gzip
#> DataFrame with 3 rows and 1 column
#> gzip_file
#> <character>
#> 1 results/SE.csv.gz
#> 2 results/VE.csv.gz
#> 3 results/VI.csv.gz
The column we want to use is “gzip_file.” For the argument targets in the
SYSargsList function, it should provide the name of the correspondent step in
the Workflow and which outfiles you would like to be incorporated in the next
step.
The argument inputvars allows the connectivity between outfiles and the
new targets file. Here, the name of the previous outfiles should be provided
it. Please note that all outfiles column names must be unique.
It is possible to keep all the original columns from the targets files or remove
some columns for a clean targets file.
The argument rm_targets_col provides this flexibility, where it is possible to
specify the names of the columns that should be removed. If no names are passing
here, the new columns will be appended.
appendStep(sal) <- SYSargsList(step_name = "gunzip",
targets = "gzip", dir = TRUE,
wf_file = "gunzip/workflow_gunzip.cwl", input_file = "gunzip/gunzip.yml",
dir_path = system.file("extdata/cwl", package = "systemPipeR"),
inputvars = c(gzip_file = "_FILE_PATH_", SampleName = "_SampleName_"),
rm_targets_col = "FileName",
dependency = "gzip")
We can check the targets automatically create for this step,
based on the previous outfiles:
targetsWF(sal[3])
#> $gunzip
#> DataFrame with 3 rows and 2 columns
#> gzip_file SampleName
#> <character> <character>
#> 1 results/SE.csv.gz SE
#> 2 results/VE.csv.gz VE
#> 3 results/VI.csv.gz VI
We can also check all the expected outfiles for this particular step, as follows:
outfiles(sal[3])
#> $gunzip
#> DataFrame with 3 rows and 1 column
#> gunzip_file
#> <character>
#> 1 results/SE.csv
#> 2 results/VE.csv
#> 3 results/VI.csv
Now, we can observe that the third step has been added and contains one substep.
sal
#> Instance of 'SYSargsList':
#> WF Steps:
#> 1. export_iris --> Status: Pending
#> 2. gzip --> Status: Pending
#> Total Files: 3 | Existing: 0 | Missing: 3
#> 2.1. gzip
#> cmdlist: 3 | Pending: 3
#> 3. gunzip --> Status: Pending
#> Total Files: 3 | Existing: 0 | Missing: 3
#> 3.1. gunzip
#> cmdlist: 3 | Pending: 3
#>
In addition, we can access all the command-lines for each one of the substeps.
cmdlist(sal["gzip"], targets = 1)
#> $gzip
#> $gzip$SE
#> $gzip$SE$gzip
#> [1] "gzip -c results/setosa.csv > results/SE.csv.gz"
The final step in this simple workflow is an R code step. For that, we are using
the LineWise constructor function as demonstrated above.
One interesting feature showed here is the getColumn method that allows
extracting the information for a workflow instance. Those files can be used in
an R code, as demonstrated below.
getColumn(sal, step = "gunzip", 'outfiles')
#> SE VE VI
#> "results/SE.csv" "results/VE.csv" "results/VI.csv"
appendStep(sal) <- LineWise(code = {
df <- lapply(getColumn(sal, step = "gunzip", 'outfiles'), function(x) read.delim(x, sep = ",")[-1])
df <- do.call(rbind, df)
stats <- data.frame(cbind(mean = apply(df[,1:4], 2, mean), sd = apply(df[,1:4], 2, sd)))
stats$species <- rownames(stats)
plot <- ggplot2::ggplot(stats, ggplot2::aes(x = species, y = mean, fill = species)) +
ggplot2::geom_bar(stat = "identity", color = "black", position = ggplot2::position_dodge()) +
ggplot2::geom_errorbar(ggplot2::aes(ymin = mean-sd, ymax = mean+sd), width = .2, position = ggplot2::position_dodge(.9))
},
step_name = "iris_stats",
dependency = "gzip")
The precisely same workflow can be created by importing the steps from an
R Markdown file.
As demonstrated above, it is required to initialize the project with SPRproject function.
importWF function will scan and import all the R chunk from the R Markdown file
and build all the workflow instances. Then, each R chuck in the file will be
converted in a workflow step.
sal_rmd <- SPRproject(logs.dir = ".SPRproject_rmd")
#> Creating directory '/home/dcassol/danielac@ucr.edu/projects/BioC2021_Workshop/systemPipeWorkshop2021/vignettes/bioc2021/.SPRproject_rmd'
#> Creating file '/home/dcassol/danielac@ucr.edu/projects/BioC2021_Workshop/systemPipeWorkshop2021/vignettes/bioc2021/.SPRproject_rmd/SYSargsList.yml'
sal_rmd <- importWF(sal_rmd,
file_path = system.file("extdata", "spr_simple_wf.Rmd", package = "systemPipeR"))
#> Reading Rmd file
#>
#> ---- Actions ----
#> Checking chunk SPR option
#> Ignore non-SPR chunks: 17
#> Checking chunk eval values
#> Resolve step names
#> Check duplicated step names
#> Checking chunk dependencies
#> Use the previous step as dependency for steps without 'spr.dep' options: 27
#> Parse chunk code
#> ---- Succes! Create output ----
#> Now importing step 'export_iris'
#> Now importing step 'gzip'
#> Now importing step 'gunzip'
#> Now importing step 'stats'
Let’s explore the workflow to check the steps:
stepsWF(sal_rmd)
#> $export_iris
#> Instance of 'LineWise'
#> Code Chunk length: 1
#>
#> $gzip
#> Instance of 'SYSargs2':
#> Slot names/accessors:
#> targets: 3 (SE...VI), targetsheader: 1 (lines)
#> modules: 0
#> wf: 1, clt: 1, yamlinput: 4 (inputs)
#> input: 3, output: 3
#> cmdlist: 3
#> Sub Steps:
#> 1. gzip (rendered: TRUE)
#>
#>
#>
#> $gunzip
#> Instance of 'SYSargs2':
#> Slot names/accessors:
#> targets: 3 (SE...VI), targetsheader: 1 (lines)
#> modules: 0
#> wf: 1, clt: 1, yamlinput: 4 (inputs)
#> input: 3, output: 3
#> cmdlist: 3
#> Sub Steps:
#> 1. gunzip (rendered: TRUE)
#>
#>
#>
#> $stats
#> Instance of 'LineWise'
#> Code Chunk length: 5
dependency(sal_rmd)
#> $export_iris
#> [1] ""
#>
#> $gzip
#> [1] "export_iris"
#>
#> $gunzip
#> [1] "gzip"
#>
#> $stats
#> [1] "gunzip"
codeLine(sal_rmd)
#> gzip AND gunzip step have been dropped because it is not a LineWise object.
#> export_iris
#> mapply(function(x, y) write.csv(x, y), split(iris, factor(iris$Species)), file.path("results", paste0(names(split(iris, factor(iris$Species))), ".csv")))
#> stats
#> df <- lapply(getColumn(sal, step = "gunzip", "outfiles"), function(x) read.delim(x, sep = ",")[-1])
#> df <- do.call(rbind, df)
#> stats <- data.frame(cbind(mean = apply(df[, 1:4], 2, mean), sd = apply(df[, 1:4], 2, sd)))
#> stats$species <- rownames(stats)
#> plot <- ggplot2::ggplot(stats, ggplot2::aes(x = species, y = mean, fill = species)) + ggplot2::geom_bar(stat = "identity", color = "black", position = ggplot2::position_dodge()) + ggplot2::geom_errorbar(ggplot2::aes(ymin = mean - sd, ymax = mean + sd), width = 0.2, position = ggplot2::position_dodge(0.9))
targetsWF(sal_rmd)
#> $export_iris
#> DataFrame with 0 rows and 0 columns
#>
#> $gzip
#> DataFrame with 3 rows and 2 columns
#> FileName SampleName
#> <character> <character>
#> 1 results/setosa.csv SE
#> 2 results/versicolor.csv VE
#> 3 results/virginica.csv VI
#>
#> $gunzip
#> DataFrame with 3 rows and 2 columns
#> gzip_file SampleName
#> <character> <character>
#> 1 results/SE.csv.gz SE
#> 2 results/VE.csv.gz VE
#> 3 results/VI.csv.gz VI
#>
#> $stats
#> DataFrame with 0 rows and 0 columns
To include a particular code chunk from the R Markdown file in the workflow analysis, please use the following code chunk options:
- `spr='r'`: for code chunks with R code lines;
- `spr='sysargs'`: for code chunks with an `SYSargsList` object;
- `spr.dep=<StepName>`: for specify the previous dependency.
For example:
```{r step_1, eval=TRUE, spr=‘r,’ spr.dep=‘step_0’}
```{r step_2, eval=TRUE, spr=‘sysargs,’ spr.dep=‘step_1’}
For spr = 'sysargs', the last object assigned must to be the SYSargsList, for example:
targetspath <- system.file("extdata/cwl/example/targets_example.txt", package = "systemPipeR")
HW_mul <- SYSargsList(step_name = "Example",
targets = targetspath,
wf_file = "example/example.cwl", input_file = "example/example.yml",
dir_path = system.file("extdata/cwl", package = "systemPipeR"),
inputvars = c(Message = "_STRING_", SampleName = "_SAMPLE_"))
Also, note that all the required files or objects to generate one particular
command-line step must be defined in a R code chunk imported.
The motivation for this is that when R Markdown files are imported, the
spr = 'sysargs' R chunk will be evaluated and stored in the workflow control
class as the SYSargsList object, while the R code based (spr = 'r') is not
evaluated, and until the workflow is executed it will be store as an expression.
For running the workflow, runWF function will execute all the command-lines
store in the workflow container.
sal <- runWF(sal)
#> Running Step: export_iris
#>
|
| | 0%
|
|======================================================================| 100%
#> Step Status: Success
#> Running Step: gzip
#>
|
| | 0%
|
|======================= | 33%
|
|=============================================== | 67%
|
|======================================================================| 100%
#> ---- Summary ----
#> Targets Total_Files Existing_Files Missing_Files gzip
#> 1 SE 1 1 0 Success
#> 2 VE 1 1 0 Success
#> 3 VI 1 1 0 Success
#>
#> Step Status: Success
#> Running Step: gunzip
#>
|
| | 0%
|
|======================= | 33%
|
|=============================================== | 67%
|
|======================================================================| 100%
#> ---- Summary ----
#> Targets Total_Files Existing_Files Missing_Files gunzip
#> 1 SE 1 1 0 Success
#> 2 VE 1 1 0 Success
#> 3 VI 1 1 0 Success
#>
#> Step Status: Success
#> Running Step: iris_stats
#>
|
| | 0%
|
|======================================================================| 100%
#> Step Status: Success
This essential function allows the user to choose one or multiple steps to be
executed using the steps argument. However, it is necessary to follow the
workflow dependency graph. If a selected step depends on a previous step(s) that
was not executed, the execution will fail.
sal <- runWF(sal, steps = c(1,3))
Also, it allows forcing the execution of the steps, even if the status of the
step is 'Success' and all the expected outfiles exists.
Another feature of the runWF function is ignoring all the warnings
and errors and running the workflow by the arguments warning.stop and
error.stop, respectively.
sal <- runWF(sal, force = TRUE, warning.stop = FALSE, error.stop = TRUE)
When the project was initialized by SPRproject function, it was created an
environment for all objects created during the workflow execution. This
environment can be accessed as follows:
viewEnvir(sal)
The workflow execution allows to save this environment for future recovery:
sal <- runWF(sal, saveEnv = TRUE)
To check the summary of the workflow, we can use:
sal
#> Instance of 'SYSargsList':
#> WF Steps:
#> 1. export_iris --> Status: Success
#> 2. gzip --> Status: Success
#> Total Files: 3 | Existing: 3 | Missing: 0
#> 2.1. gzip
#> cmdlist: 3 | Success: 3
#> 3. gunzip --> Status: Success
#> Total Files: 3 | Existing: 3 | Missing: 0
#> 3.1. gunzip
#> cmdlist: 3 | Success: 3
#> 4. iris_stats --> Status: Success
#>
To access more details about the workflow instances, we can use the statusWF method:
statusWF(sal)
#> $export_iris
#> DataFrame with 1 row and 2 columns
#> Step Status
#> <character> <character>
#> 1 export_iris Success
#>
#> $gzip
#> DataFrame with 3 rows and 5 columns
#> Targets Total_Files Existing_Files Missing_Files gzip
#> <character> <numeric> <numeric> <numeric> <list>
#> 1 SE 1 1 0 Success
#> 2 VE 1 1 0 Success
#> 3 VI 1 1 0 Success
#>
#> $gunzip
#> DataFrame with 3 rows and 5 columns
#> Targets Total_Files Existing_Files Missing_Files gunzip
#> <character> <numeric> <numeric> <numeric> <list>
#> 1 SE 1 1 0 Success
#> 2 VE 1 1 0 Success
#> 3 VI 1 1 0 Success
#>
#> $iris_stats
#> DataFrame with 1 row and 2 columns
#> Step Status
#> <character> <character>
#> 1 iris_stats Success
systemPipeR workflows instances can be visualized with the plotWF function.
This function will make a plot of selected workflow instance and the following information is displayed on the plot:
- Workflow structure (dependency graphs between different steps);
- Workflow step status, *e.g.* `Success`, `Error`, `Pending`, `Warnings`;
- Sample status and statistics;
- Workflow timing: running duration time.
If no argument is provided, the basic plot will automatically detect width, height, layout, plot method, branches, etc.
plotWF(sal, show_legend = TRUE, width = "80%", rstudio = TRUE)
For more details about the plotWF function, please see here.
systemPipeR compiles all the workflow execution logs in one central location,
making it easier to check any standard output (stdout) or standard error
(stderr) for any command-line tools used on the workflow or the R code stdout.
Also, the workflow plot is appended at the beginning of the report, making it
easier to click on the respective step.
sal <- renderLogs(sal)
This section of the tutorial provides an introduction to the usage of the
systemPipeR features on a cluster.
The computation can be greatly accelerated by processing many files
in parallel using several compute nodes of a cluster, where a scheduling/queuing
system is used for load balancing. For this the clusterRun function submits
the computing requests to the scheduler using the run specifications
defined by runWF.
A named list provides the computational resources. By default, it can be defined
the upper time limit in minutes for jobs before they get killed by the scheduler,
memory limit in Mb, number of CPUs, and number of tasks.
The number of independent parallel cluster processes is defined under the
Njobs argument. The following example will run one process in parallel using
for each 4 CPU cores. If the resources available on a cluster allow running all
the processes simultaneously, then the shown sample submission will utilize in
total four CPU cores (NJobs * ncpus). Note, clusterRun can be used
with most queueing systems as it is based on utilities from the batchtools
package which supports the use of template files (*.tmpl) for defining the
run parameters of different schedulers. To run the following code, one needs to
have both a conf file (see .batchtools.conf.R samples here)
and a template file (see *.tmpl samples here)
for the queueing available on a system. The following example uses the sample
conf and template files for the Slurm scheduler provided by this package.
library(batchtools)
resources <- list(walltime=120, ntasks=1, ncpus=4, memory=1024)
sal <- clusterRun(sal, FUN = runWF,
more.args = list(),
conffile=".batchtools.conf.R",
template="batchtools.slurm.tmpl",
Njobs=1, runid="01", resourceList=resources)
Note: The example is submitting the jog to short partition. If you desire to
use a different partition, please adjust accordingly (batchtools.slurm.tmpl).
systemPipeR workflow management system allows to translate and export the
workflow build interactively to R Markdown format or an executable bash script.
This feature advances the reusability of the workflow, as well as the flexibility
for workflow execution.
sal2rmd function takes an SYSargsList workflow container and translates it to
SPR workflow template R markdown format. This file can be imported with the
importWF function, as demonstrated above.
sal2rmd(sal)
sal2bash function takes an SYSargsList workflow container and translates
it to an executable bash script, so one can run the workflow without loading
SPR or using an R console.
sal2bash(sal)
It will be generated on the project root an executable bash script, called by
default the spr_wf.sh. Also, a directory ./spr_wf will be created and store
all the R scripts based on the workflow steps. Please note that this function will
“collapse” adjacent R steps into one file as much as possible.
If you desire to resume or restart a project that has been initialized in the past,
SPRproject function allows this operation.
With the resume option, it is possible to load the SYSargsList object in R and
resume the analysis. Please, make sure to provide the logs.dir location, and the
corresponded YAML file name.
The current working directory needs to be in the project root directory.
sal <- SPRproject(resume = TRUE, logs.dir = ".SPRproject",
sys.file = ".SPRproject/SYSargsList.yml")
If you choose to save the environment in the last analysis, you can recover all
the files created in that particular section. SPRproject function allows this
with load.envir argument. Please note that the environment was saved only with
you run the workflow in the last section (runWF()).
sal <- SPRproject(resume = TRUE, load.envir = TRUE)
After loading the workflow at your current section, you can check the objects created in the old environment and decide if it is necessary to copy them to the current environment.
viewEnvir(sal)
copyEnvir(sal, list="plot", new.env = globalenv())
This option will keep all previous logs in the folder; however, if you desire to
clean the execution history and restart the workflow, the restart=TRUE option
can be used.
sal <- SPRproject(restart = TRUE, overwrite = TRUE, load.envir = FALSE)
The last and more drastic option from SYSproject function is to overwrite the
logs and the workflow. This option will delete the hidden folder and the
information on the SYSargsList.yml files. This will not delete any parameter
file nor any results it was created in previous runs. Please use with caution.
sal <- SPRproject(overwrite = TRUE)
If you desire to import and run a RNA-seq pipeline, please follow:
systemPipeRdata::genWorkenvir(workflow = "rnaseq")
setwd("rnaseq")
sal <- SPRproject()
sal <- importWF(sal, file_path = "systemPipeRNAseq_importWF.Rmd", verbose = FALSE)
sal <- runWF(sal)
plotWF(sal, rstudio = TRUE)
sal <- renderLogs(sal)
systemPipeShiny
(SPS) extends the widely used systemPipeR
(SPR) workflow
environment with a versatile graphical user interface provided by a Shiny
App. This allows non-R users, such as
experimentalists, to run many systemPipeR’s workflow designs, control, and
visualization functionalities interactively without requiring knowledge of R.
Most importantly, SPS has been designed as a general purpose framework for
interacting with other R packages in an intuitive manner. Like most Shiny Apps,
SPS can be used on both local computers as well as centralized server-based
deployments that can be accessed remotely as a public web service for using
SPR’s functionalities with community and/or private data. The framework can
integrate many core packages from the R/Bioconductor ecosystem. Examples of
SPS’ current functionalities include:
View our online demo app:
| Type and link | option changed | notes |
|---|---|---|
| Default full installation{blk} | See installation | full app |
| Minimum installation{blk} | See installation | no modules installed |
| Login enabled{blk} | login_screen = TRUE; login_theme = "empty" |
no modules installed |
| Login and login themes{blk} | login_screen = TRUE; login_theme = "random" |
no modules installed |
| App admin page{blk} | admin_page = TRUE |
or simply add “?admin” to the end of URL of demos |
For the login required demos, the app account name is “user” password “user”.
For the admin panel login, account name “admin”, password “admin”.
Please DO NOT delete or change password when you are using the admin features. shinyapp.io will reset the app once a while, but this will affect other people who are trying the demo simultaneously.
sessionInfo()
#> R version 4.1.0 (2021-05-18)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.2 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
#> LAPACK: /home/dcassol/src/R-4.1.0/lib/libRlapack.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] systemPipeRdata_1.21.3 systemPipeShiny_1.3.2
#> [3] drawer_0.1.0 spsComps_0.3.0
#> [5] spsUtil_0.1.2 shiny_1.6.0
#> [7] systemPipeR_1.27.20 ShortRead_1.51.0
#> [9] GenomicAlignments_1.29.0 SummarizedExperiment_1.23.1
#> [11] Biobase_2.53.0 MatrixGenerics_1.5.1
#> [13] matrixStats_0.60.0 BiocParallel_1.27.2
#> [15] Rsamtools_2.9.1 Biostrings_2.61.1
#> [17] XVector_0.33.0 GenomicRanges_1.45.0
#> [19] GenomeInfoDb_1.29.3 IRanges_2.27.0
#> [21] S4Vectors_0.31.0 BiocGenerics_0.39.1
#> [23] BiocStyle_2.21.3
#>
#> loaded via a namespace (and not attached):
#> [1] backports_1.2.1 GOstats_2.59.0 BiocFileCache_2.1.1
#> [4] lazyeval_0.2.2 shinydashboard_0.7.1 GSEABase_1.55.1
#> [7] splines_4.1.0 crosstalk_1.1.1 ggplot2_3.3.5
#> [10] digest_0.6.27 htmltools_0.5.1.1 magick_2.7.2
#> [13] GO.db_3.13.0 fansi_0.5.0 magrittr_2.0.1
#> [16] checkmate_2.0.0 memoise_2.0.0 BSgenome_1.61.0
#> [19] base64url_1.4 remotes_2.4.0 tzdb_0.1.2
#> [22] limma_3.49.1 shinyFiles_0.9.0 annotate_1.71.0
#> [25] vroom_1.5.3 askpass_1.1 prettyunits_1.1.1
#> [28] jpeg_0.1-9 colorspace_2.0-2 blob_1.2.2
#> [31] rappdirs_0.3.3 xfun_0.24 dplyr_1.0.7
#> [34] crayon_1.4.1 RCurl_1.98-1.3 jsonlite_1.7.2
#> [37] graph_1.71.2 genefilter_1.75.0 brew_1.0-6
#> [40] survival_3.2-11 VariantAnnotation_1.39.0 glue_1.4.2
#> [43] gtable_0.3.0 zlibbioc_1.39.0 DelayedArray_0.19.1
#> [46] Rgraphviz_2.37.0 scales_1.1.1 pheatmap_1.0.12
#> [49] DBI_1.1.1 edgeR_3.35.0 Rcpp_1.0.7
#> [52] viridisLite_0.4.0 xtable_1.8-4 progress_1.2.2
#> [55] bit_4.0.4 DT_0.18 AnnotationForge_1.35.0
#> [58] shinyjqui_0.4.0 htmlwidgets_1.5.3 httr_1.4.2
#> [61] RColorBrewer_1.1-2 shinyAce_0.4.1 ellipsis_0.3.2
#> [64] farver_2.1.0 shinydashboardPlus_2.0.2 pkgconfig_2.0.3
#> [67] XML_3.99-0.6 sass_0.4.0 dbplyr_2.1.1
#> [70] locfit_1.5-9.4 utf8_1.2.2 labeling_0.4.2
#> [73] tidyselect_1.1.1 rlang_0.4.11 later_1.2.0
#> [76] AnnotationDbi_1.55.1 munsell_0.5.0 tools_4.1.0
#> [79] cachem_1.0.5 generics_0.1.0 RSQLite_2.2.7
#> [82] evaluate_0.14 stringr_1.4.0 fastmap_1.1.0
#> [85] yaml_2.2.1 fs_1.5.0 knitr_1.33
#> [88] bit64_4.0.5 purrr_0.3.4 KEGGREST_1.33.0
#> [91] RBGL_1.69.0 mime_0.11 xml2_1.3.2
#> [94] biomaRt_2.49.2 rstudioapi_0.13 debugme_1.1.0
#> [97] compiler_4.1.0 plotly_4.9.4.1 filelock_1.0.2
#> [100] curl_4.3.2 png_0.1-7 tibble_3.1.3
#> [103] bslib_0.2.5.1 stringi_1.7.3 highr_0.9
#> [106] GenomicFeatures_1.45.0 lattice_0.20-44 Matrix_1.3-4
#> [109] styler_1.5.1 shinyjs_2.0.0 vctrs_0.3.8
#> [112] pillar_1.6.2 lifecycle_1.0.0 BiocManager_1.30.16
#> [115] jquerylib_0.1.4 data.table_1.14.0 bitops_1.0-7
#> [118] httpuv_1.6.1 rtracklayer_1.53.0 R6_2.5.0
#> [121] BiocIO_1.3.0 latticeExtra_0.6-29 hwriter_1.3.2
#> [124] bookdown_0.22 promises_1.2.0.1 assertthat_0.2.1
#> [127] openssl_1.4.4 Category_2.59.0 rjson_0.2.20
#> [130] shinyWidgets_0.6.0 withr_2.4.2 batchtools_0.9.15
#> [133] GenomeInfoDbData_1.2.6 parallel_4.1.0 hms_1.1.0
#> [136] grid_4.1.0 tidyr_1.1.3 rmarkdown_2.9
#> [139] shinytoastr_2.1.1 restfulr_0.0.13
This project is funded by NSF award ABI-1661152.